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Abstract 

We discuss a general technique that can be used to form a differentiable bound on the optima 
of non-differentiable or discrete objective functions. We form a unified description of these methods 
and consider under which circumstances the bound is concave. In particular we consider two concrete 
applications of the method, namely sparse learning and support vector classification. 

1 Optimization by Variational Bounding 

We consider the general problem of function maximization, max/(a;) for vector x. When / is differentiable 

and X continuous, optimization methods that use gradient information are typically preferred over non- 
gradient based approaches since they are able to take advantage of a locally optimal direction in which 
to search. However, in the case that / is not differentiable or x is discrete, gradient based approaches are 
not directly applicable. In that case, alternatives such as relaxation, coordinate-wise optimization and 
stochastic approaches are popular [1] . Our interest is to discuss another general class of methods that yield 
differentiable surrogate objectives for discrete x or non-differentiable /. The Variational Optimization 
(VO) approach is based on the simple bounc(3 

r =max/(a;)>(/(x))^(^|,)^i?(0) (1) 

where (•)p denotes expectation with respect to the distribution p defined over the solution space C. The 
parameters 9 of the distribution p{x\9) can then be adjusted to maximize the lower bound E{9). This 
bound can be trivially made tight provided the distribution p{x\9) is flexible enough to allow all its mass 
to be placed in the optimal state x* = argmaxj; f{x). 

Under mild restrictions, the bound is differentiable, see scction dl.ip . and the bound is a smooth 
alternative objective function (see also section (|4.ip on the relation to 'smoothing' methods). The degree 
of smoothness (and the deviation from the original objective) increases as the dispersion of the variational 
distribution increases. In section (ll.2p we give sufficient conditions for the variational bound to be concave. 
The purpose of this paper is to demonstrate the ease with which VO can be applied and to discuss its 
merits as a general way to construct a smooth alternative objective. 

1.1 Differentiability of the variational objective 

When f{x) is not differentiable, under weak conditions E{9) can be made differentiable. The gradient of 
E{9) is given by 



dE _ d 
'89 ~ 89 



c 



f{x)p{x\9)dx. (2) 



^Note that this provides a bound on the optimum of f{x), not necessarily on f{x) itself. 
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We can bring the differential under the integral sign provided: 

(i) f{x)p{x\9) is Lebesgue integrable and differentiable with respect to 9 

(ii) there exists an integrable function F : C ^ M. such that, for all 9, 

< F{^) (3) 

These weak conditions mean that for a large-class of problems in which / is non-differentiable or x 
discrete, the bound E{9) is differentiable with respect to the parameters 9. For example, consider the 
non-differentiable objective f{x) = I [a; > 0] with x normally distributed with mean and unit variance, 
p{x\9) =Af{x\9,l). Then |f = exp(-6lV2)/v^. 



^f{x)p{x\9) 



1.2 Concavity of the variational bound 

Here we consider the special case of continuous x and non-differentiable f{x) and describe conditions 
which are sufficient for E{9) to be concave in 60. To do this we take advantage of a recent result by [2] to 
prove the concavity of a Kullback-Leibler based variational approximation for generalized linear models. 
We first introduce the general concept of an expectation affine distribution. 

Definition (Expectation affine) A distribution p{x\9) is expectation affine if, for linear functions a, /3, 
distribution q{z) and function /, 

(/(-))p(.|.) = (/("W- + /5W)),, (4) 

Theorem 1 Let f{x) be a concave function and p{x\9) an expectation affine distribution. Then E{9) = 
{fi^))p{x\e) concave in 9. 

Proof Defining A = 1 — A and using the fact that p is expectation affine, 

E{X9i -f A02) - (/(a(A^?i + \92)z + /3(A0i + \92)) ) (5) 

\ / q{z) 

= (f(x{a z + I39i) + A(a (^2) z + /3(02))) \ (6) 

\ ^ ^ / q{z) 

Since / is concave, then 

f{Xxi + ~Xx2)>Xf{x,) + Xf{x2) (7) 
Hence 

E{X9i + ~X92) > X(f{a{9,)z + P{9,)) ) + x(f{a{92)z + P{92)) ) (8) 

\ / q{z) \ I q(z) 

= XE{9i) + XE{92) (9) 



1.2.1 Concavity for Gaussian p 

As an example, consider a multivariate Gaussian distribution p{x\pi^C) with mean /i and covariance 
E = CC^, where C is the Cholesky factor. Then using the change of variables z — {x — /i) we have 

(/(-))p(.|..c)-(/(^- + /^))aA(.|o./) (10) 

^The variational objective E{9) = (/(a;))q(a;j6i) is bounded above and below by the maximum and minimum of f{x). 
Where the minimum is finite then, E{9) cannot be strictly concave over all parameter space 6 G R^. This prevents us from 
forming strictly convex problems for fully unconstrained 9 whenever / has finite minimum. However, even in this case, the 
bound may be quasi-concave (uni- modal). 
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Fi gure 1: (e) The solid line represents the function \w\ and the dashed lines the bound 
plotted as a function of fj, for three different values a e {0.1,0.5, 1} (from bottom to top in the figure), 
(b) The lasso objective function f{w) plotted for A = 2, \ = 4, b = —1, c = 0. In this case the optimal 
setting for w based on the lasso objective is w — 0. However, the optimal setting that minimizes the 
upper bound is slightly larger than zero. This means that the bound does not force the weights to exactly 
zero, (c) The lasso objective for A = 2, A = 4, 6 = —7, c = 0. In this case the optimal setting for w for 
both the lasso objective and the variational bound arc non-zero and numerically very close. This is to 
be expected since, as we move further away from the origin, the variational approximation more closely 
matches the objective, resulting in a closer match of their optima. 



That is, we can replace the average w.r.t. Af {x\fi, E) by an average with respect to the standard multi- 
variate normal Af {z\0, 1) by using a linear transformation of parameters. To show that this is expectation 
affine, for I?-dimensional vector x and Gaussian p{x\iJ,,C) we define the D + D{D + l)/2 dimensional 
parameter vector 9 which stacks the mean /i and lower triangular elements of C . By defining a and /3 as 
linear projections that select the C and components respectively of 0, we see that the multivariate Gaus- 
sian is expectation affine. Hence, provided f{x) is concave, then E[9) = (/(^))p(a;|^ c) jointly-concave 
in (/i,C). 



1.3 Bound optimization 

VO provides a differentiable bound, whose optimum will still need to be found numerically. Since the 
bound if differentiable, any standard gradient-based method can be applied, such as Newton or Quasi- 
Newton methods. In the following sections, we discuss two example applications, the lasso and SVM 
objectives, both of which result in convex bound minimization problems. For the lasso problem we chose 
to use a simple diagonal approximation to Newton's method to optimize the bound since this is simple 
to implement and often surprisingly effective. Naturally, more sophisticated procedures could be applied, 
though this is not the main focus of our interest here. For the SVM we chose L-BFGS, though again there 
are many alternatives that could be considered. 



2 Sparse Least Squares Regression 

The lasso is a well-known least squares regression estimator that encourages sparse solutions through a 
one- norm regularizer For Z3-dimensional inputs x", outputs y", n = 1, . . . , A^, and positive regularising 
constant A, the lasso objective is to minimize 

N D D 

f{w) = ^ (y" - w'^x^'f + A ^ |m,| = c + w'^b + w'^ Aw + A ^ (11) 

n— 1 i=l i—1 

with 

c = ^(2/")^ 6=-2^2/"a;", and AEE^a;"x""^. (12) 
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The one-norm term \wi\ is non-differentiable at the origin and hence standard gradient based opti- 
mization algorithms cannot be directly applied. Section 11.21 shows that the variational bound conserves 
the convexity of an objective if a Gaussian (or more generally, expectation affine) distribution is used. 
Since the lasso objective is convex in w it is natural to consider a Gaussian variational distribution 
p{w\9) = J\f {w\^, S) with mean fi and covariance S. This gives the upper bound /* < C), where 

Eit^^ C) - E ((y" - + (I-'I)aA(.,..^„) ■ (13) 

From Theorem [U this is jointly convex in (/i, C) where E — CC~^ , the Cholesky decomposition. It is 
straightforward to show that 



where (jiix) — e ^ / y/27rdy is the cumulative density function of the standard normal distribution. 
More explicitly, 

E{ii,C) = c + fi^b + fi^Afi + trace (A^) + ^fi^(^l - 2(j)(^- +2--^e"^j (15) 

where af = Ttu. The gradient and Hessian w.r.t fi are straightforward to compute using the results 

Since the bound is non-differentiable for = 0, we use a fixed isotropic covariance T, = a'^I with a > 
and minimize the bound with respect to fi alone. 

2.1 Bounding the error 

The difference between the bound and the function is given by (see Figure ([Ij) 

E{fi) - fifi) = trace (AS) + XJ2^^ (^^^ (1 - 20 {-z^)) + ^fe"* - (17) 

where Zi = fii/ai. This splits into a set of independent terms, each of which has its maximum at Zi = 0, 
i.e. fii — 0, giving a maximal error of trace (AS) + -^=^iO-i. Hence to guarantee to find a surrogate 
objective whose optimum value is within of the true lasso optimal value we must use, for E ~ <y^I, 



a< , ^ = ( J( ^^^-fAftrace(A) ) I . (18) 

- ^trace(A) V J V2^ J 

Hence, provided that a is set small enough and assuming that the optimum of the bound is found 
accurately, we can guarantee to find the optimum value of the lasso objective to within some specified 
tolerance. 
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Small problems - D = 50 



Algorithm 



Variational optimization 
Shooting 

Iterated ridge regression 
Smoothing (integral sigmoid) 
Smoothing (%/ + e) 
Projection 
Sub- gradients 



Algorithm 



Variational optimization 
Shooting 

Iterated ridge regression 
Smoothing (integral sigmoid) 
Smoothing + e) 

Projection 
Sub-gradients 



Solution time (SD) 


Relative Error (SD) 


0.1269 (0.0680) 


3.791 


X 


10- 


-14 


(5.62 


X 


10" 


-14) 


0.0142 (0.0037) 


2.485 


X 


10" 


-16 


(2.21 


X 


10" 


-16) 


0.0268 (0.0208) 


1.851 


X 


10" 


-10 


(6.27 


X 


10" 


-10) 


0.1088 (0.0603) 


4.421 


X 


10" 


-15 


(3.06 


X 


10" 


-15) 


0.0830 (0.0178) 


4.916 


X 


10" 


-15 


(1.62 


X 


10" 


-14) 


0.0224 (0.0107) 


2.672 


X 


10" 


-16 


(1.80 


X 


10" 


-15) 


0.0472 (0.0180) 


9.176 


X 


10" 


-10 


(8.31 


X 


10" 


-09) 



Larger problems - D = 200 



Solution time (SD) 


Relative Error (SD) 


0.1308 (0.0405) 


1.923 


X 


10" 


-15 


(6.01 


X 


10" 


-16) 


0.0676 (0.0209) 


2.656 


X 


10" 


-16 


(3.17 


X 


10" 


-16) 


0.1645 (0.0844) 


4.004 


X 


10" 


-14 


(1.19 


X 


10" 


-13) 


1.2305 (0.3381) 


2.638 


X 


10" 


-15 


(9.19 


X 


10" 


-16) 


1.1385 (0.3673) 


1.471 


X 


10" 


-15 


(1.86 


X 


10" 


-15) 


0.3121 (0.0726) 


5.325 


X 


10" 


-16 


(2.85 


X 


10" 


-16) 


1.3478 (0.3273) 


6.038 


X 


10" 


-11 


(9.42 


X 


10" 


-10) 



Table 1: Performance and time taken (in seconds) for algorithms solving lasso problems of two different 
sizes. All algorithms were implemented in MATLAB and run on a 2.27GHz 4GB machine. The averages 
and the standard deviations result from 500 experiments. 



2.2 Lasso Experiments 

We first generated a sparse ZJ-dimensional parameter vector with components Ui sampled according to 

{0 with probability 0.5 

A/" (uj 1 5, 1) with probability 0.25 (19) 
N {ui \ - 5, 1) with probability 0.25 

We then created a set of = lOD training points in _D-dimensional space from the standard multivariate 
normal distribution N (x\Q,I). The clean labels for these points were given by — vJx" and the final 
noisy labels by y" = + e" where is Gaussian random noise with mean zero and standard deviation 
^■^W X]n=i 1^0 I' chosen to obscure but not dominate the clean labels. We set A — SOD to give solutions 
with sparsity roughly the same as the initial parameter vector u. 

At each iteration we fixed S and updated /i using a diagonal approximation to the Newton method 
(to avoid the cost of inverting the full Hessian). For gradient E'^ and Hessian elements i?-- the updates 
were 

Mr"=Mf'-0.1^. (20) 

There is a trade-off between using a large a, resulting in a bound whose optimum can be numerically 
obtained easily versus using a small a for which the optimum of the bound is more difficult to find 
numerically, but for which the resulting bound optimum more closely matches the optimum of the true 
objective. Our experience is that in practice it is therefore best to start iterative procedures that optimize 
the bound with a relatively large a, reducing this with each iteration. For this particular set of experiments 
we used an initial covariance matrix of 0.1 times the identity and reduced it by a factor of 0.9 at each 
iteration. The initial estimate for was a vector of zeros. We terminated when the mean absolute 
difference in the solution elements between iterations was less than 10^^^ of the mean absolute value of 
the elements of the current solution and took this terminal ^ to be the estimate for argmin^ /{"w)- 

We tested our method against a number of standard lasso solvers in Mark Schmidt's minFunc MATLAB 
packag^ to gain an indication of its performance relative to established approaches. Since the true optima 

^www.di . ens . f r/~mschmidt/Sof tware/minFunc .html 
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are not known, we measured the success of the algorithms relative to the best solution /best, found by 
any of the algorithms on each problem instance. In Table [T] we give the mean and standard deviation of 
relative errors (/ — /best) /(/best)- The results indicate that VO is capable of approximating the global 
optimum in moderate time and that it scales well with problem size. In particular, VO provides solutions 
of similar quality to other smoothed methods in [3] (the first based on approximating the Li norm using 
an integral of two sigmoid functions and the second using \/ x"^ + e where e is some positive constant j^. 



2.3 Fused Lasso Sparse Regression 

From table (HI), the standard lasso problem is most effectively solved by the shooting algorithm 5 . Shoot- 
ing corresponds to solving each component whilst keeping the others fixed, and subsequently cycling 
through components to convergence. This is particularly effective in the standard lasso problem since the 
corresponding one-dimensional optima for each component can be found in closed form; also the objec- 
tive only weakly couples the components of the vector w. In contrast, the fused lasso problem induces 
additional sparsity between adjacent elements using the objective: 

D D 

f{w) = c + w'^b + w'^Aw + \i^\wi \ + \2^\wi - (21) 

2=1 i=2 

The additional term also introduces strong dependencies between adjacent components and component- 
wise ('shooting') methods struggle to converge [B]. In applying VO, with S — a^I, the additional terms 

{\w.i - Wi-i|)7v-(^|^^^2/) in the bound are given by: 

As for the standard lasso case, we can readily obtain the gradient and Hessian of the objective. 



2.4 Fused Lasso Experiments 

We set the Z?-dimensional parameter vector so that there would be some real underlying sparsity in 
the differences between elements of the optimum w. The first element in the parameter vector ui was 
generated as in the previous problem, then all subsequent elements were generated by: 

{Ui^i with probability 0.5 

with probability 0.25 . . 

A/'(m,|5, 1) with probability 0.125 

A/" (mj I - 5, 1) with probability 0.125 

We again created a set of = 101? training points in Z3-dimensional space from the standard multivariate 
normal distribution. The clean labels for these points were then given by i/q — vJx"^ and the final noisy 
labels by y" = Uq + where e" is random noise with mean zero and standard deviation O-l-^ X]^=i IVol- 
We ran a, D = 500 experiment 1000 times with Ai = 500 and A2 = 200. These values were chosen 
to give rise to a problem in which the loss function had a significant impact but did not enforce more 
sparsity than was typically present in the original parameter vector u. The initial standard deviation was 
0.1, and we shrunk it by 0.9 at each iteration. For comparison we used the SLEP package [7|, which has 
very competitive performance compared to other state-of-the-art solvers. The SLEP package is based on 
a version of Nesterov's method [8], a two step gradient method with backtracking line search. Using a 
shrinking variance and a convergence tolerance of 10~^, the mean relative error in terms of function value 
of VO compared to SLEP was 1.59 x 10^^. The mean of the relative distances L2{x — y)/L2{y) between 
each VO solution x and SLEP solution y was 0.0135. The mean CPU time for VO was 0.0947s and for 
the SLEP algorithm 0.0724^. Compared to other approaches, our method is very simple to implement 
and has good performance compared to the state of the art. 

"^Thc improvement of VO over other smoothing approaches is most likely accounted for by implementation differences 
and our reduction of a at each iteration. 

^SLEP is implemented in C, so the comparison of speed with our MATLAB implementation is not definitive. 
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3 Soft Margin Support Vector Machines 



The support vector machine (SVM) is a raethod for finding the optimal separating hyperplane of set of N 
data points x" with labels y" = ±1. This separating hyperplane may be in the space of the data points, 
or in a higher dimensional feature space if linear separation is not possible. In this second case, the data 
points are mapped to the feature space with a function ip{x^). When the data are not separable the soft 
margin support vector machine is used. This adds slack variables ^" which allow misclassified data points 
at a cost controlled by a variable C . The complete soft margin support vector machine has primal form: 



N 



min w^w + C \^ f " 

to, 6.? ^ 
n—1 

subject to y''{w'^ip{x'') +b)>l-CC and C > 0. 



(24) 
(25) 



This is classically solved by the method of Lagrange multipliers [5]. This yields a convex, constrained 
quadratic program referred to as the dual problem. The dual form is often preferred because it provides 
a convenient way to deal with the constraints, and because kernel methods can be easily applied. 

As an alternative, the primal problem can be re-expressed as a convex, unconstrained problem (see, for 
example, |10)). To remove the constraints from the primal problem, note that the objective is minimized 
when the slack variables ^" are as small as possible without violating the constraints. This is achieved by 
setting 5" = max{y"(w^(^(x") + 6) — 1, 0}, yielding the objective to be minimized: 



N 



C max{l - 7;"(w'r(p(x") + 6), 0}. 



(26) 



These slack variable terms are hinge loss penalties on misclassified data points. The hinge loss is convex 
in w and 5, thus the whole objective (a sum of hinge losses plus a convex quadratic term) is convex. It has 
discontinuities across each of the hyperplanes w'^ip{x^^) + b = y"". To allow feature mapping we must solve 
the primal problem in a reproducing kernel Hilbert space without using Lagrange multipliers. Without 
loss of generality, we may assume that the solution w can be expressed as 



N 



(27) 



which gives rise to the kernelized objective function: 



N 



N 



fil3, b) = (5'^KI3 + C ^ max{l - Knmr - hy'\0} 



(28) 



n=l 



■m — 1 



where the kernel matrix is given by if„m = y^y"^ip{x"')^ (p{x"^). This kernelized problem is convex; 
however it is discontinuous and therefore not directly amenable to gradient descent methods. 

The variational bound is the expectation of / with respect to distributions over (3 and b. We choose 
again Gaussian distributions over the parameters: /? distributed with mean fip and covariance a^I and b 
independently with mean /ifc and variance a^. This gives the bound /* < E with 



N 



E = (/(/?, 6))p(;3,6|^„^,) = ^i'pK^ip + trace((7^if ) + C ^[v^<j, [ — ] + 



5(7^) 



27r 



(29) 



where 



N 

V K 



\ 



N 



(30) 
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The gradients used in optimizing the variational objective can then also be calculated explicitly: 

P m—1 m—1 



gjn - E 2^-™^" - ^ E ) (31) 



^E^4-) (32) 



db V 

m—1 ^ 

We can also bound the difference between the variational and original objectives, just as we did in the 
lasso case. The difference between the bound and the function is given by 

Eifip^fih)- f{f^p,f^b)= trace {a^K)+cY, ) + ^""^ -maxK,0}j (33) 

The terms of the sum are each maximized at i/" = 0. i^" are not independent, so these conditions are 
unlikely to occur simultaneously, but we can bound the difference by 

^ n 

Eifi^, fih) - f{f^p,f^b) < trace (a^K) + C V (34) 

To guarantee to find a surrogate SVM objective whose optimum is within a specified function tolerance 
A f , one must thus satisfy 



- < ' fJf^+A,trace(^ - ^] (35) 
- y/ trace {K) \ \ \ Stt ^ ^ 7 i ^ ^ 



where 



N 



N 



1 + E ^nm- (36) 



m—1 



3.1 Chapelle's smoothing method 

Chapelle [10] formulates the problem as in Equation ([28]), though allowing for any form of loss function L 
rather than just the hinge loss. 

N N 

Ech.pMl3, b) = + c ^ L(l - ^ rKnrn " &2/") (37) 

n— 1 m—1 

He then derives Newton methods with quadratic loss and, more relevant to our problem, the Huber loss: 



if 1^1 < h 



iHubcr(0 - <^ - I'^l - (38) 

I max{^, 0| otherwise 

The form of the Huber loss can be seen in Figure ([3]). It is an upper bound on the hinge loss, /* < i^chapono, 
and becomes tighter for smaller h. This bound can be used in the same way as the variational bound, 
giving a smoothed approximation to the SVM objective. As with VO, the objective is a sum of convex 
terms and is thus itself convex. The Hessian of -Echapciic is given by 

^-1 a^gchapciio a'^gchapciic ^^^l i^/?! ^K + KI^K ' ^"^^^ 



dpab 902 
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Figure 3: (a) hinge loss and VO smoothed hinge loss; (b) Huber loss; (c) smoothed hinge loss gradient; 
(d) Huber loss gradient. 



where 1 is a vector of ones of dimension N and is a diagonal matrix indicating whether a given data 
point is in the quadratic portion of the loss function: 

\ otherwise ^ ' 

Given that we wish to optimize the hinge loss, we must choose h to be small. The quadratic region in 
the Huber loss is then small and as a consequence it is likely that for some trial solutions no training points 
will lie in this quadratic portion. This results in a value of zero for all Hessian elements corresponding 
to the offset (all blocks in the black matrix in Eguation ip^ except the bottom right). The Hessian 
is therefore non-invertible and Chapelle's Newton update cannot be evaluated. If instead the L-BFGS 
method is used [TT] it can easily be shown that the approximate Hessian is always positive definite and 
thus Newton-like performance can be achieved without risking invalid updates. This also means the cubic 
cost of inversion in the full Newton method is avoided. 



3.2 SVM Experiments 

Synthetic problems were constructed by generating a random separation vector of length 3.5 and di- 
mension 100. The length of this vector was chosen to give a problem which could be mostly, but not 
completely, separated using a linear kernel. The positively labelled points were then distributed with a 
standard multivariate normal about this separation vector and the negatively labelled points with the 
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Figure 4: A plot of the log of approximate solution time for SVM problems of varying size. For smaller 
problems SMO is the fastest method, but for larger ones it is beaten by variational optimization. 

same distribution about the origin. The objective we used had a cost coefficient C = 10. We define ap- 
proximate convergence as finding a function value with relative error ( -^ ^■^ ) of less than 0.1% compared 
to the function value /* of the MATLAB quadratic program on the dual when run until completion. For 
each of 50 training sets for each training set size, each method was given the same initial solution (a vector 
generated from a standard normal distribution) and run until approximate convergence. 

We compared VO to a modification of Chapelle's method (with and without shrinking the Huber 
parameter), sequential minimal optimization (SMO) and quadratic programming on the dual problem. 
We applied Chapelle's method by optimizing a Huber loss smoothed function using the L-BFGS imple- 
mentation in the minFunc package. To ensure comparability we treated VO in the same way, with the 
additional consideration of variance reduction at each iteration. We shrank the variance by a factor of 
10 every 250 iterations, from an initial value of 0.001. The Huber loss function was similarly shrunk by a 
factor of 10 every 250 iterations, but from an initial value of 10. SMO 12' solves the dual by analytically 
solving a sub-problem in two components and iterating until convergence. We used the implementation 
in the BRML toolbox [T3], which is based on the working set selection proposed by [13]. The dual was 
solved using the MATLAB built in function, quadprog. 

Figure (|4]) shows VO scaling better than both quadratic programming and SMO with respect to the 
number of training points. SMO is highly effective for smaller problems. This makes sense in the context 
of the findings of [T2^; for sparse problems (since they have fewer support vectors) SMO is extremely fast. 
What is perhaps more surprising is that VO outperforms the Chapelle method (without shrinking) roughly 
by a constant factor. This outperformance doesn't come from faster evaluation of the gradient (quite the 
opposite; evaluation of a Huber loss gradient takes on average 0.0102s on our machine compared to 
0.0237s for the gradient of the variational objective). Thus, we must conclude that there is some inherent 
advantage in the variance shrinking schedule which allows quicker convergence. 

The advantage of shrinking the variance (or Huber parameter) can be explained by imagining the 
impact of updates. Figure ([3]) shows the function value and gradient of both the Huber loss and VO 
smoothed loss. L-BFGS updates use an approximate Hessian, rather than the exact Hessian of a Newton 
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update, but the behaviour is similar. These Newton type updates will fall further from optima when 
the Hessian varies more quickly. The figure makes it clear that the Hessian changes more quickly in 
the solution space when the variance is larger. Thus large variance helps to avoid slow convergence by 
minimizing overshooting of updates. 

The impact of increasing the Huber parameter is similar, increasing the region about zero for which 
the gradient is constant. It would seem that so long as the initial value of h is sufficiently large, the same 
benefit should be found as with VO; and indeed this is borne out by the results in Figure 



4 Relation to Other Optimization Methods 

To our knowledge, relatively little attention has been given to the method of variational optimization and 
its relation to other approaches. In particular, the use of analytic averages to calculate gradients seems 
to be less considered. 

In [151 11^1 Berny considers a form of VO for general binary optimization Xi e {0, 1} through the use 
of a factorized distribution p{x\9) — Hj ^f' ~ ^0 contrast to analytical VO, in Berny's work 

samples are drawn from p{x\9) to approximate the expectations, rather than averages being computed 
analytically. Analytical bounds for binary problems are often easy to compute. Consider the binary 
quadratic program, in which one optimizes 

f{x) = x'^Ax + b^x (41) 

where x is constrained on the binary hypercube Xi e {0, 1}. The variational lower bound on the optimal 
solution is given by 

E{e) EE (x'^Ax + (42) 

One may then proceed to optimize E(6), 9i e [0, 1], using any method of choice. However, we have not 
given this problem further consideration in this paper since it gives rise to non-convex problems and as 
such the quality of the solution is entirely determined by the specific optimization algorithm used. Note 
also that the binary quadratic programme is in general NP-hard, so that it would be unreasonable to 
expect any single algorithm to perform well for arbitrary A. Whilst a numerical analysis of the different 
algorithms for various settings of A would be interesting, it goes beyond the scope of the current paper. 



4.1 Relation to smoothing 

Smoothing methods (see for example [T]) replace an objective f{x) by a 'smoother' differentiable version 
g{x), with the property that g(x) will be easier to optimize than f{x). Typically smoothing methods have 
a parameter that controls how to interpolate between f{x) and smoother versions thereof. Whilst VO 
also makes a smooth objective, it is not formally the same as a classical smoothing approach since, for 
the VO objective E{6), 6 inhabits a potentially different parameter space than x. 



4.2 Relation to relaxation 

In solution set relaxation methods the constraint x e C is replaced by a relaxed constraint 2? D C. Such 
relaxations provide an upper bound on /* . That is 

(/(a;))p(,|e) < max/(a;) < max/(x) (44) 

This means that we may bracket the optimum by the combined use of relaxation and variational optimiza- 
tion. Note that VO provides a lower bound on the optimum for all values of the variational parameter 9, 
whilst the relaxation is not guaranteed to provide a bound for all x £ V. 
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4.3 Free energy approach 



Defining tlie distribution 



Zj 



(45) 



where Z normalizes p, then we can find an approximation to v{x) based on minimising 



KL(g|p) = {\ogvix\Q)) 



p{x\B) 




+const. 



(46) 



E(9) 



Typically the family of distributions p{x\0) is chosen to ensure that E{9) is tractably computable. For 
a fixed /3, one then minimizes E(0). Gradually, as /3 is increased towards infinity, the distribution p{x) 
becomes more sharply peaked around its maximal state, so that the approximating distribution p{x\9) 
will also tend to this state. This method is quite general and applicable to either discrete or continuous 
optimization problems, see for example [17,. 

Whilst the method does not appear to provide a bound on /* in general, there are cases in which 
this approach does provide a bound. In particular, if the entropic term ([ogp(x\9)) ^^^^g^ is constant with 
respect to 9, then minimising E(9) is equivalent to maximising {fix))p(^^^gy In this case, the expectation 
is a bound on the optimum /*. For example, for Gaussian p{x\9) = A/" (x|/i, S) the entropy is a function 
of the covariance S alone, so that for fixed covariance, free energy optimization is equivalent to optimizing 
the VO bound (using a fixed covariance and optimizing with respect to the mean). In general, however, 
the two approaches are different. 

4.4 Estimation of Distribution Algorithms 

Estimation of distribution algorithms (ED As) are a broad set of optimization algorithms for the problem 
max^ f{w). An EDA starts with a prior distribution po{w) over the solution space. At each iteration this 
is then in used to generate a new set of solutions {w"}. The function values /(w") of these solutions are 
calculated and the distribution for the next iteration is constructed by some function (whose specification 
determines the type of EDA in question): pt+i{w) — F(^pt{w),{f{w")},{w^}). If VO were used, with 
expectations (/(w))p(i(,|e) found approximately by sampling, it would be an example of an EDA. The term 
was first used in the field of bioinformatics [TQ ■ ED As have been widely developed by the evolutionary 
computing community [19] and applied to both discrete (for example [20]) and continuous [21] problems. 

5 Conclusion 

We have considered a variational optimization method which is applicable to problems whose properties 
normally prohibit gradient based algorithms. We have shown that the variational objective can be differ- 
entiable for non-differentiable and discontinuous objectives. The experimental results show the ability of 
the method to provide good approximate solutions to the lasso and soft margin support vector machine 
problems. It guarantees a bound on the solution, and can conserve convexity properties of the objective. 
The bound is interesting since the procedure to find a bound on the optimum value is straightforward, 
even in cases where a direct bound on the objective function itself is not apparent. Note that in our two 
example cases, computing the expectation over the parameters could be carried out analytically. In cases 
where the expectation can't be performed analytically, provided a distribution p{x\9) is chosen for which 
samples can be drawn easily, then an approximate version of VO can be readily obtained numerically. 
Our description also makes a relation to alternative sampling based EDA procedures which can be viewed 
as special cases of VO and as such may motivate those methods as providing approximate bounds on the 
optima of the objective. 

In summary it is our view that the simplicity and generality of variational optimization lends itself to 
further study, with the potential for application to other interesting objectives. 
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